Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Nov 6, 2025

  • add parallelization filters based on model arch information
  • refactored profiling script to be cleaner and make it ready for TEP/DEP support
  • update tests

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for parallelization mapping configuration in model profiling with per-GPU sweeps for prefill and decode phases.
    • Implemented context length cap customization during profiling, prioritizing user-provided limits.
    • Enhanced profiling plots with mapping labels for improved visibility of configuration variants.
  • Improvements

    • Enhanced model information extraction with comprehensive metadata (context length, expert count, cache parameters).
    • Improved automatic GPU discovery and search space generation logic.
    • Restructured configuration handling with clearer separation of concerns.
  • Refactor

    • Streamlined configuration modification protocol for consistency across model backends.

@tedzhouhk tedzhouhk requested review from a team as code owners November 6, 2025 02:07
@github-actions github-actions bot added the feat label Nov 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

Walkthrough

This pull request refactors the profiler system to centralize model metadata into a ModelInfo dataclass, introduces parallel-mapping support for per-GPU profiling sweeps, moves ConfigModifierProtocol to a dedicated protocol module, and updates the search-space auto-generation logic with clearer GPU-discovery control flow and conditional compilation.

Changes

Cohort / File(s) Summary
Model Information Centralization
benchmarks/profiler/utils/model_info.py
Introduces new ModelInfo Pydantic dataclass with fields for model_size, is_moe, max_context_length, num_experts, intermediate_size, num_kv_heads, and quantization_block_size. Updates get_model_info return type from dict to ModelInfo and extracts additional metadata from model configs.
Configuration Protocol Restructuring
benchmarks/profiler/utils/config.py, benchmarks/profiler/utils/config_modifiers/__init__.py, benchmarks/profiler/utils/config_modifiers/protocol.py
Removes ConfigModifierProtocol from config.py, relocates it to new protocol.py module. Updates TYPE_CHECKING import in __init__.py to reference new location. Protocol defines classmethods for config conversion, size setting, model naming, port allocation, and image/model updates.
Parallel Mapping Infrastructure
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py
Adds new module with ParallelizationMapping frozen dataclass (tp, tep, dep fields) and functions get_candidate_parallel_mappings (validates candidates via divisibility rules for KV heads, experts, intermediate_size) and apply_parallel_mapping_to_config (applies appropriate size modifier).
Config Modifier Enhancements
benchmarks/profiler/utils/config_modifiers/sglang.py
Prioritizes --model-path argument when deriving model name in SGLangConfigModifier.get_model_name, falling back to --served-model-name if not found.
Search Space and GPU Discovery Refactoring
benchmarks/profiler/utils/search_space_autogen.py
Introduces centralized model_info resolution using new ModelInfo type, replaces MOE_MODEL_MAX_NUM_GPUS with MOE_MODEL_MAX_NUM_GPU_FACTOR, refactors GPU discovery logic with explicit enable_gpu_discovery flag gating computation, and unifies config loading flow.
Profiler CLI and Argument Parsing
benchmarks/profiler/utils/profiler_argparse.py
Changes default GPU-per-engine bounds to 0 for both min/max, adds --num-gpus-per-node CLI argument with default 0, removes is_moe_model CLI argument, makes auto-generation unconditional.
Plotting and Visualization
benchmarks/profiler/utils/plot.py
Expands plot_prefill_performance signature to accept explicit lists (prefill_num_gpu, prefill_ttft, prefill_thpt_per_gpu) and new parallel_mapping_labels parameter. Updates plot_decode_performance to optionally handle 4-tuple decode_results with mapping labels for annotated point labeling.
Main Profiler Logic
benchmarks/profiler/profile_sla.py
Major refactoring: replaces args.is_moe_model with args.model_info.is_moe, introduces per-GPU parallel-mapping sweeps with candidate-mapping computation and validation, adds context-length cap customization logic, updates results handling to track per-mapping metadata, and refactors deployment/AI configurator paths to apply parallel mappings.
Test Fixtures and Utilities
tests/profiler/test_profile_sla_aiconfigurator.py, tests/profiler/test_profile_sla_dryrun.py
Replaces is_moe_model flags and scattered max_context_length attributes with explicit ModelInfo instances in Args fixtures. Adds enable_gpu_discovery flags and updates mock model_info to return ModelInfo instead of dict.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • benchmarks/profiler/profile_sla.py – Extensive rework of MoE flag usage, parallel-mapping sweep logic, results aggregation, and deployment/AI-configurator path handling; carefully verify mapping application and latency estimation flow.
  • benchmarks/profiler/utils/search_space_autogen.py – Refactored GPU discovery control flow with enable_gpu_discovery gating; verify logic for MoE-aware max_gpu computation and min_gpu derivation from model_size.
  • benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py – New module with divisibility validation rules for KV heads, experts, and intermediate_size; ensure edge cases (missing fields, warnings) are correctly handled.
  • benchmarks/profiler/utils/model_info.py – Verify metadata extraction logic handles variability in model config attribute naming (e.g., num_kv_heads vs. num_key_value_heads) and quantization_block_size detection across different quantization schemes.
  • test updates – Confirm all test fixtures correctly initialize ModelInfo with expected values and that mock behaviors align with new centralized model-info resolution.

Poem

🐰 A rabbit hops through configs bright,
ModelInfo fields unified in sight!
Mappings dance on GPU shores,
Profiling wisdom opens doors,
Factored and refactored, all is right!

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers the three main changes (add parallelization filters, refactor profiling script, update tests) but lacks the structured sections required by the template (Overview, Details, Where to start, Related Issues). Expand the description to follow the template structure: add an Overview section explaining the purpose, a Details section with more context about the changes, a 'Where should the reviewer start?' section highlighting key files, and a Related Issues section with any relevant GitHub issue numbers.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: add parallelization filters' directly matches the main objective of adding parallelization filters based on model architecture information, which is the primary focus of the extensive changes across multiple files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
tests/profiler/test_profile_sla_aiconfigurator.py (2)

21-22: Drop redundant noqa. Ruff now flags # noqa: E402 as unused (RUF100). Removing the directive keeps lint clean without changing behavior. Based on static analysis hints.

-from benchmarks.profiler.utils.model_info import ModelInfo  # noqa: E402
+from benchmarks.profiler.utils.model_info import ModelInfo

26-33: Remove unused fixture argument. request is no longer used here and triggers Ruff ARG001; dropping it keeps the autouse override while satisfying lint. Based on static analysis hints.

-@pytest.fixture(autouse=True)
-def logger(request):
+@pytest.fixture(autouse=True)
+def logger():
benchmarks/profiler/utils/plot.py (1)

48-54: Docstring parameter rename. The docstring still refers to mapping_labels, but the function now accepts parallel_mapping_labels. Updating the wording keeps the docs aligned with the signature.

-        mapping_labels: optional list of strings describing parallelization mapping per point
+        parallel_mapping_labels: optional list of strings describing parallelization mapping per point
tests/profiler/test_profile_sla_dryrun.py (2)

22-24: Drop redundant noqa. Same as in the aiconfigurator tests: Ruff reports RUF100 for this # noqa: E402, so we can remove it without changing behavior. Based on static analysis hints.

-from benchmarks.profiler.utils.model_info import ModelInfo  # noqa: E402
+from benchmarks.profiler.utils.model_info import ModelInfo

30-37: Remove unused fixture argument. This autouse override doesn’t use request; trimming it avoids Ruff ARG001 while preserving the intended behavior. Based on static analysis hints.

-@pytest.fixture(autouse=True)
-def logger(request):
+@pytest.fixture(autouse=True)
+def logger():
benchmarks/profiler/profile_sla.py (1)

118-132: Simplify sweep_max_context_length logic.

The hasattr() check on line 120 is unnecessary since model_info is always a ModelInfo object with a max_context_length attribute (possibly None).

Apply this diff to simplify:

-        # Determine sweep max context length: allow user-provided cap to override model's if smaller
-        sweep_max_context_length = getattr(args, "max_context_length", None)
-        if hasattr(args, "model_info") and args.model_info is not None:
-            model_max_ctx = args.model_info.max_context_length
-            if sweep_max_context_length is None:
-                sweep_max_context_length = model_max_ctx
-            elif model_max_ctx is not None and model_max_ctx < sweep_max_context_length:
+        # Determine sweep max context length: use user-provided cap, constrained by model's maximum
+        sweep_max_context_length = getattr(args, "max_context_length", None)
+        model_max_ctx = args.model_info.max_context_length
+        if sweep_max_context_length is None:
+            sweep_max_context_length = model_max_ctx
+        elif model_max_ctx is not None and model_max_ctx < sweep_max_context_length:
                 logger.info(
                     f"User-provided max_context_length={sweep_max_context_length} exceeds model's maximum {model_max_ctx}; using model maximum."
                 )
                 sweep_max_context_length = model_max_ctx
-        if sweep_max_context_length is None:
-            logger.warning(
-                "No max_context_length available from args or model; proceeding without a cap."
-            )
+        if sweep_max_context_length is None:
+            logger.warning(
+                "No max_context_length available from args or model; proceeding without a cap."
+            )
benchmarks/profiler/utils/search_space_autogen.py (1)

85-89: Clarify error message for missing model.

The error message "No model provided, cannot auto-generate GPU search space" is misleading because line 64 already attempts to extract the model name from the config if not explicitly provided. A clearer message would indicate failure to extract or determine the model.

Consider this diff:

             if not args.model:
-                # TODO: get model info provided DGD config
-                error_msg = "No model provided, cannot auto-generate GPU search space. Please provide `--model` or GPU info"
+                error_msg = "Failed to determine model name from config. Cannot auto-generate GPU search space. Please provide `--model` explicitly or specify GPU parameters."
                 logger.error(error_msg)
                 raise RuntimeError(error_msg)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d036fc and 55379a8.

📒 Files selected for processing (12)
  • benchmarks/profiler/profile_sla.py (16 hunks)
  • benchmarks/profiler/utils/config.py (1 hunks)
  • benchmarks/profiler/utils/config_modifiers/__init__.py (1 hunks)
  • benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py (1 hunks)
  • benchmarks/profiler/utils/config_modifiers/protocol.py (1 hunks)
  • benchmarks/profiler/utils/config_modifiers/sglang.py (1 hunks)
  • benchmarks/profiler/utils/model_info.py (5 hunks)
  • benchmarks/profiler/utils/plot.py (2 hunks)
  • benchmarks/profiler/utils/profiler_argparse.py (2 hunks)
  • benchmarks/profiler/utils/search_space_autogen.py (3 hunks)
  • tests/profiler/test_profile_sla_aiconfigurator.py (2 hunks)
  • tests/profiler/test_profile_sla_dryrun.py (12 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
benchmarks/profiler/utils/config_modifiers/__init__.py (1)
benchmarks/profiler/utils/config_modifiers/protocol.py (1)
  • ConfigModifierProtocol (21-84)
benchmarks/profiler/utils/profiler_argparse.py (1)
benchmarks/profiler/utils/search_space_autogen.py (1)
  • auto_generate_search_space (32-140)
benchmarks/profiler/utils/plot.py (1)
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py (1)
  • label (31-38)
tests/profiler/test_profile_sla_aiconfigurator.py (1)
benchmarks/profiler/utils/model_info.py (1)
  • ModelInfo (107-114)
tests/profiler/test_profile_sla_dryrun.py (1)
benchmarks/profiler/utils/model_info.py (1)
  • ModelInfo (107-114)
benchmarks/profiler/utils/config_modifiers/protocol.py (3)
components/src/dynamo/planner/defaults.py (1)
  • SubComponentType (140-142)
benchmarks/profiler/utils/config_modifiers/sglang.py (10)
  • convert_config (82-186)
  • set_config_tp_size (189-210)
  • set_config_tep_size (213-245)
  • set_config_dep_size (248-280)
  • get_model_name (283-308)
  • get_port (311-339)
  • get_kv_cache_size_from_dynamo_log (342-355)
  • load_default_config (44-46)
  • update_model (49-74)
  • update_image (77-79)
benchmarks/profiler/utils/config.py (1)
  • update_image (360-380)
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py (3)
benchmarks/profiler/utils/model_info.py (1)
  • ModelInfo (107-114)
benchmarks/profiler/utils/config_modifiers/protocol.py (3)
  • set_config_tp_size (32-38)
  • set_config_tep_size (41-48)
  • set_config_dep_size (51-58)
benchmarks/profiler/utils/config_modifiers/sglang.py (3)
  • set_config_tp_size (189-210)
  • set_config_tep_size (213-245)
  • set_config_dep_size (248-280)
benchmarks/profiler/profile_sla.py (6)
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py (4)
  • ParallelizationMapping (22-38)
  • apply_parallel_mapping_to_config (156-172)
  • get_candidate_parallel_mappings (41-153)
  • label (31-38)
benchmarks/profiler/utils/profile_cache.py (4)
  • check_prefill_results_exist (26-53)
  • load_existing_prefill_results (91-108)
  • check_decode_results_exist (56-88)
  • load_existing_decode_results (111-138)
benchmarks/profiler/utils/estimate_perf.py (4)
  • estimate_prefill_perf (132-153)
  • get_max_batch_size (155-209)
  • estimate_perf (74-130)
  • get_max_kv_tokens (211-231)
deploy/utils/dynamo_deployment.py (5)
  • DynamoDeploymentClient (98-495)
  • create_deployment (220-285)
  • get_deployment_logs (444-475)
  • get_service_url (212-218)
  • delete_deployment (477-495)
benchmarks/profiler/utils/plot.py (1)
  • plot_prefill_performance (36-84)
benchmarks/profiler/utils/profile_decode.py (2)
  • get_num_request_range (25-37)
  • profile_decode_aiconfigurator (143-171)
benchmarks/profiler/utils/search_space_autogen.py (5)
benchmarks/profiler/utils/model_info.py (2)
  • ModelInfo (107-114)
  • get_model_info (117-229)
benchmarks/profiler/utils/config_modifiers/sglang.py (1)
  • get_model_name (283-308)
benchmarks/profiler/utils/config_modifiers/trtllm.py (1)
  • get_model_name (282-301)
benchmarks/profiler/utils/config_modifiers/vllm.py (1)
  • get_model_name (231-250)
deploy/utils/gpu_inventory.py (1)
  • get_gpu_summary (387-417)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4144/merge) by tedzhouhk.
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py

[error] 1-1: pre-commit isort hook modified the file.


[error] 1-1: pre-commit black hook reformatted the file.


[error] 134-134: Ruff: Ambiguous variable name: I (E741). Consider using a more descriptive name.

🪛 Ruff (0.14.3)
tests/profiler/test_profile_sla_aiconfigurator.py

21-21: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)


26-26: Unused function argument: request

(ARG001)

tests/profiler/test_profile_sla_dryrun.py

22-22: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py

136-136: Ambiguous variable name: I

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (14)
benchmarks/profiler/utils/model_info.py (4)

107-114: LGTM: Clean data container with appropriate types.

The ModelInfo Pydantic model provides a well-structured container for model metadata with clear field types and optional defaults.


128-131: Good refactoring: eliminates config mutation.

Using a local variable instead of mutating the config object is cleaner and avoids side effects.


215-219: Verify that taking max() of block size list is semantically correct.

The code handles the case where quantization_block_size is a list by taking the maximum value. Ensure this is the correct interpretation for all quantization schemes that return lists (e.g., separate input/output block sizes).

Consider adding a comment explaining why max() is chosen:

-        # Handle case where block size is a list (e.g., [128, 128] for [input, output] block sizes)
+        # Handle case where block size is a list (e.g., [128, 128] for [input, output] block sizes).
+        # We take the maximum to ensure the most conservative block size constraint for validation.
         if (
             isinstance(quantization_block_size, list)
             and len(quantization_block_size) > 0
         ):
             quantization_block_size = max(quantization_block_size)

221-229: LGTM: Clean ModelInfo construction.

The return statement clearly maps all detected model attributes to the ModelInfo fields.

benchmarks/profiler/profile_sla.py (4)

437-442: LGTM: Correct attention_dp_size calculation for DEP.

The logic correctly sets attention_dp_size to num_gpus when DEP is used (data parallelism across experts), and defaults to 1 for TP-based configurations.


537-593: LGTM: Sound mapping selection strategy.

The best mapping selection logic correctly prioritizes:

  1. Meeting latency targets (TTFT/ITL)
  2. Maximizing throughput per GPU among valid configurations

The dry-run fallback with TP mapping is appropriate.


595-676: LGTM: Correct interpolation with best prefill mapping.

The prefill interpolation phase correctly:

  • Applies the best selected mapping to the config
  • Uses sweep_max_context_length for bounds
  • Provides appropriate tp_size fallback for AI configurator (line 629)

678-770: LGTM: Correct interpolation with best decode mapping.

The decode interpolation phase correctly:

  • Applies the best selected mapping to the config
  • Computes attention_dp_size appropriately for DEP configurations
  • Provides appropriate tp_size fallback for AI configurator
benchmarks/profiler/utils/config_modifiers/parallelization_mapping.py (3)

21-38: LGTM: Clean parallelization mapping dataclass.

The frozen dataclass design is appropriate for this value object. The label() method provides clear human-readable descriptions for logging and plotting.


41-66: LGTM: Clear MoE-aware candidate generation.

The function correctly generates different parallelization strategies based on model type (MoE vs. dense) and phase (prefill vs. decode).


156-172: LGTM: Correct mapping application logic.

The function correctly delegates to the appropriate config modifier method based on the mapping type (TP/TEP/DEP) and phase.

benchmarks/profiler/utils/search_space_autogen.py (3)

37-58: LGTM: Cleaner config loading flow.

The refactored config loading logic is more straightforward - always load the config first, then optionally update it.


60-76: LGTM: Centralized model info retrieval.

The refactoring centralizes model info retrieval and stores it in args.model_info for use throughout the profiling flow. The informative logging is helpful for debugging.


123-139: LGTM: Sensible default GPU configuration.

The fallback logic provides reasonable default values (min=1, max=4, gpus_per_node=8) when GPU discovery is disabled, with clear logging.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Hongkuan Zhou <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for commenting out the test

@nv-anants nv-anants disabled auto-merge November 7, 2025 19:51
@nv-anants nv-anants merged commit 7750ed1 into main Nov 7, 2025
36 of 40 checks passed
@nv-anants nv-anants deleted the hzhou/parallel-filter branch November 7, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants